This vignette describes the workflow of the R package warpDE,
showing three different methods to infer differentially expressed genes
from reconstructed cell lineages. After a first step of cell ordering
reconstruction performed with Slingshot (see (K. Street et al. 2017)), we try to find genes with significantly different behaviours between two cell lineages.
This vignette uses the data from (R. B. Fletcher et al. 2017) of Olfactory Epithelium cells from mouses. We restricted the data to only two cell lineages for simplicity purpose.
Before searching for differentially expressed genes between lineages, we have to preprocess the data and run slingshot on it. We use data from (R. B. Fletcher et al. 2017), that we already have filtered and normalized with Scone. Scone tries different normalization techniques, batch or technical effects removal methods and provides tools to rank this different normalizations.
We load this data from the package warpDE, which contains also the different functions that we use in this study. We also load the package slingshot (K. Street et al. 2017) which allows us to perform the lineage inferrence.
library(warpDE)
library(slingshot)
library(gridExtra)
# For nice colors
library(RColorBrewer)
data("OE_counts")
We perform a usual transformation with log(1+x) and we run PCA as the dimension reduction technique:
pca <- prcomp(t(log1p(counts)))
# nice colors for the clusters
clust_col = brewer.pal(length(unique(clusters)),"Set3")[as.numeric(clusters)]
plot3d(pca$x[,1:3], col = clust_col, size = 5)
We apply slingshot to infer the lineages curves:
slrun <- slingshot(pca$x[,1:3], clusters, start.clus = '1', end.clus = '4')
plot3d(pca$x[,1:3], col = clust_col, size = 5)
plot3d.SlingshotDataSet(slrun, type = "curves", add = T, size = 0.5)
Slingshot draws smooth curves in the 3D representation of the cells.
We can also have a 2D plot of the lineages by projecting on the first two PCs:
# generate 2D captures of the 3d plots for the rmarkdown
par(cex.main = 1)
plot(pca$x[,1:2], col = clust_col, pch=16, asp = 1)
for(c in curves(slrun)){
lines(c$s[c$tag,c(1,2)], lwd = 1)
}
title("View of the first two Pcs of the data with principal curves")
Then, we store the slingshot outputs for our future DE analysis, and we create a new data objects with the desired inputs:
times <- pseudotime(slrun)
weights <- curveWeights(slrun)
# Normalization of the weights (we want convex weights)
w1 <- weights[,1]/(weights[,1] + weights[,2])
w2 <- weights[,2]/(weights[,1] + weights[,2])
w <- cbind(w1,w2)
df <- new("warpDEDataSet", counts = counts, t = times, w = w)
The other dataset in the package warpDEDataSet_example contains exactly this df object, so you are not forced to re-do the whole Slinghshot workflow.
We plot the gene expression profile for the first gene of the dataset: CreER. The points are colored by lineage: red is the first lineage (GBC and neuronal cells) and blue is the second lineage (sustentacular cells).
gene <- "CreER"
reg_gam(df, gene, regression = F)
We fit a separate ‘ns’ (natural spline) regression for each lineage as the alternative model and one ‘ns’ regression including both lineages at the same time (the null model). This gives us a representation of the activation pattern for a given gene.
reg_gam(df, gene, legend.show = T, reg.f = "ns")$pl
The main goal of the analysis is to find genes which are strongly differrentially expressed between the two reconstructed lineages. In a sense we are searching for branching expression pattern, branching curves in our visualization.
grid.arrange(reg_gam(df, 'Cyp2g1')$pl, reg_gam(df, "Myo9a")$pl, ncol = 2)
On the left, we displayed Cyp2g1, a differentially expressed gene with a pattern similar to the ones that we want to detect. On the right, we see an unintersting pattern for us; both genes are expressed similarly in both lineages.
The comparision between the alternative model and the null model can
be computed via likelihood ratio test or with AIC. The function likelihood_rank computes these values and returns a ranking. This ranking function takes a few minutes to run.
The main tuning parameters are:
the choice of regression function: reg.f is either ‘loess’ (from (Cleveland and Devlin 1988)) or ‘ns’ for natural cubic splines;
the smoothing parameter: s.df, the equivalent degrees of freedoms for natural splines and span, the span parameter for the loess.
s.rank <- likelihood_rank(df, 'ns', pval = T, s.df = 4)
s.aic.rank <- s.rank$aic
s.pval.rank <- s.rank$pval
You can then plot the first genes picked by the method by running plot_genes
plot_genes(df, s.aic.rank)
plot_genes(df, s.pval.rank, order = "head", nb.show = 10, grid.size = c(2,5))
Here is also an exemple of the worst genes picked by this method :
plot_genes(df, s.pval.rank, order = "tail")
Another way to infer differentially expressed genes is to consider the gene counts as time series data. Thus, for each gene we simply compare the difference between each lineages with the Dynamic Time Warping distance which allows local distorsions of the time axis. We can also choose the regression and the smoothing parameter, even though the ranking of the genes seems to be less dependent of the tuning parameters for this method. Another important tuning parameter is the window size of the warping, which changes how locally the distorsion is performed.
s.dtw.rank <- dtw_rank(df, reg.f = "ns" , s.df = 4)
plot_genes(df, s.dtw.rank)
We also tried to combine both methods. First, we align the cells of both lineages on a common time scale with DTW. Then we compute their similarity thanks to the likelihood method. As in the other methods the tuning parameters are the choice of the regression and the smoothing parameter, as well as the DTW parameters (see (Giorgino 2009) for more details on DTW options).
We can see the warped pattern by calling dtw_align with align.show = true
gene <- "Prpf8"
grid.arrange(reg_gam(df, gene, reg.f = "ns", s.df = 4)$pl, dtw_align(df, gene, reg.f = "ns", s.df =4, align.show = T)$pl, ncol = 2)
Computation of the combined method :
s.mix.rank <- likelihood_rank(df, reg.f = "ns", s.df = 4, dtw = T)$aic
plot_genes(df, s.mix.rank)
You can compare two rankings with the rankings_compare function which wrapps three different comparisons:
A scatter plot of both lineages;
A comparison of the shared elements in the differents quantiles of the two rankings
A measure of the Kendall’s tau between the two rankings (a measure of the number of concordant and discordant permutations of ranks).
dtw.aic <- rankings_compare(s.dtw.rank, s.mix.rank)
dtw.aic$pl
dtw.aic$shared.q
## 10 50 100 150 200 250 300 350 400 450 500 550
## ns dtw vs ns aic dtw 0.6 0.6 0.56 0.58 0.6 0.64 0.67 0.68 0.7 0.71 0.73 0.74
## 600 650 700 750 800 850 900 950 1000
## ns dtw vs ns aic dtw 0.76 0.78 0.8 0.82 0.85 0.88 0.91 0.95 1
dtw.aic$kendalls.tau
## [1] 0.4311105
We provided kendall.heatmap a quick visualization tool for multiple rank comparisons:
kendall.heatmap(list(s.aic.rank, s.mix.rank, s.dtw.rank))
We also created an visualization of the distribution of the distances used for ranking genes. We plot this as an Elbow curve, which points at a hard threshold for differentially expressed genes based on this distribution.
elbow_curve(s.mix.rank)$pl
Cleveland, William S., and Susan J. Devlin. 1988. “Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting.” Journal of the American Statistical Association 83 (403): 596. doi:10.2307/2289282.
Fletcher, Russell B., Diya Das, Levi Gadye, Kelly N. Street, Ariane Baudhuin, Allon Wagner, Michael B. Cole, et al. 2017. “Deconstructing Olfactory Stem Cell Trajectories at Single-Cell Resolution.” Cell Stem Cell 20 (6). Elsevier Inc.: 817–830.e8. doi:10.1016/j.stem.2017.04.003.
Giorgino, Toni. 2009. “Computing and Visualizing Dynamic Time Warping Alignments in <i>R</i> : The <b>dtw</b> Package.” Journal of Statistical Software 31 (7): 1–24. doi:10.18637/jss.v031.i07.
Street, Kelly, Davide Risso, Russell B Fletcher, Diya Das, John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Dudoit. 2017. “Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics.” BioRxiv. doi:10.1101/128843.